Accelerating Science Discovery - Join the Discussion

OSTIblog Articles in the web crawling Topic

Federated Search - The Wave of the Future?: Part 2

by Dr. Walt Warnick 13 Mar, 2008 in Technology

by Walt Warnick and Sol Lederman

This is the second in a three part series of articles about the deficiencies of web crawling and indexing, the superiority of federated search to the serious researcher, and the value of OSTI federated search applications in advancing science. Part 1 identified a number of serious limitations of Google and the other crawlers. This article shows how federated search overcomes these limitations. The final article in the series highlights a number of federated search applications and databases that OSTI makes available to the public.

In Part 1, we explained that Google, being a surface web crawler, cannot access the deep web, which consists of content that resides in databases. We also noted that the deep web is several hundred times larger than the surface web and that a large percent of the highly sought after scientific and technical information resides in the deep web. We also explained that there is no way to determine the quality of any particular document in the surface web. Any web citizen can post a document to the web and it will likely be indexed.

Federated search applications overcome the two aforementioned limitations of surface crawlers - (1) limited access to content, and (2) the difficulty in determining its quality. Limited access is overcome by the federated search engine's specialized knowledge of how to query a database and how to retrieve its documents. The quality concern is overcome by the complementary efforts of database owners and creators of federated search applications. First, databases that are made available to federated search applications are managed by owners, or organizations, who have criteria for...

Related Topics: doe, federated search, osti, web crawling

Read more...

Federated Search - The Wave of the Future?: Part 1

by Dr. Walt Warnick 12 Mar, 2008 in Technology

by Walt Warnick and Sol Lederman

The web is growing.

For providing searchable access to the content that matters the most to scientists and researchers, Google and the other web crawlers can't keep up. Instead, growing numbers of scientists, researchers, and science attentive citizens turn to OSTI's federated search applications for high quality research material that Google can't find. And, given fundamental limitations on how web crawlers find content, those conducting research will derive even more benefit from OSTI's innovation and investment in federated search in the coming years.

This is the first of three articles that discuss and compare the strengths and weaknesses of two web search architectures: the crawling and indexing architecture as used today by Google and the federated search architecture used by Science.gov and WorldWideScience.org. This article points out the limitations of the crawling architecture for serious researchers. The second article explains how federated search overcomes these obstacles. The third article highlights a number of OSTI's federated search offerings that advance science, and suggests that federated search may someday become the dominant web search architecture. 

Google is a "surface web" crawler; it discovers content by taking a list of known web pages and following links to new web pages and to documents. This approach finds documents that have links referencing them. It finds none of the majority of web content that is contained in the "deep web."

The deep web...

Related Topics: doe, federated search, osti, web crawling

Read more...